Efficient supervised learning in networks with binary synapses 
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Recent experimental studies indicate that synaptic changes induced by neuronal activity are 
discrete jumps between a small number of stable states. Learning in systems with discrete synapses 
is known to be a computationally hard problem. Here, we study a neurobiologically plausible on- 
line learning algorithm that derives from Belief Propagation algorithms. We show that it performs 
remarkably well in a model neuron with binary synapses, and a finite number of 'hidden' states per 
synapse, that has to learn a random classification task. Such system is able to learn a number of 
associations close to the theoretical limit, in time which is sublinear in system size. This is to our 
knowledge the first on-line algorithm that is able to achieve efficiently a finite number of patterns 
learned per binary synapse. Furthermore, we show that performance is optimal for a finite number of 
hidden states which becomes very small for sparse coding. The algorithm is similar to the standard 
'perceptron' learning algorithm, with an additional rule for synaptic transitions which occur only if 
a currently presented pattern is 'barely correct'. In this case, the synaptic changes are meta-plastic 
only (change in hidden states and not in actual synaptic state) , stabilizing the synapse in its current 
state. Finally, we show that a system with two visible states and K hidden states is much more 
robust to noise than a system with K visible states. We suggest this rule is sufficiently simple to be 
easily implemented by neurobiological systems or in hardware. 



I. INTRODUCTION 



Learning and memory are widely believed to occur through mechanisms of synaptic plasticity. In spite of a huge 
amount of experimental data documenting various forms of plasticity, as e.g. long-term potentiation (LTP) and long- 
term depression (LTD), the mechanisms by which a synapse changes its efHcacy, and those by which it can maintain 
these changes over time remain unclear. Recent experiments have suggested single synapses could be similar to noisy 
binary switches [l|, Bistability could be in principle induced by positive feedback loops in protein interaction 
networks of the post-synaptic density [1, 0, [1]. Binary synapses would have the advantage of robustness to noise 
and hence could preserve memory over long time scales, compared to analog systems which are typically much more 
sensitive to noise. 

Many neural network models of memory use binary synapses to store information d, 0, H, H, U^j HHj 1121 ■ some 
of these network models, learning occurs in an unsupervised way. From the point of view of a single synapse, this 
means that transitions between the two synaptic states (a state of low or zero efficacy, and a state of high efficacy) 
are induced by pre and post-synaptic activity alone. Tsodyks 14 1 and Amit and Fusi d, have shown that the 



performance of such systems (in terms of information stored per synapse) is very poor, unless two conditions are met: 
(1) activity in the network is sparse (very low fraction of neurons active at a given time); and (2) transitions are 
stochastic, with in average a balance between up (LTP-like) and down (LTD-like) transitions. This poor performance 



has motivated further studies |12l | in which hidden states are added to the synapse in order to provide it with a 



multiplicity of time scales, allowing for both fast learning and slow forgetting. 

In a supervised learning scenario, synaptic modifications are induced not only by the activity of pre and post- 
synaptic neurons but also by an additional 'teacher' or 'error' signal which gates the synaptic modifications. The 
prototypical network in which this type of learning has been studied is the one-layer perceptron which has to perform 
a set of input-output associations, i.e. learn to classify correctly input patterns in two classes. In the case of analog 
synapses, algorithms are known to converge to synaptic weights that solve the task, provided such weights exist 
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|15lll6|. On the other hand, no efficient algorithms are known to exist in a perceptron with binary (or more generally 
with a finite number of states) synapses, in the case the number of patterns to be learned scales with the number of 
synapses. In fact, studies on the capacity of binary perceptrons used complete enumeration schemes in order 

to determine numerically the capacity. These studies found a capacity of about 0.83 bits per synapse in the random 
input - output categorization task, very close to the theoretical upper bound of 1 bit per synapse. However, it is not 
even clear whether there exist efficient algorithms that can reach a finite capacity per synapse, in the limit of a large 
network size N. Indeed, learning in such systems is known to be a NP-complete task [221 l23j. 

Recently, 'message passing' algorithms have been devised that solve efficiently non-trivial random instances of NP- 
complete optimization problems, like e.g. K-satisfiability or graph coloring 0, [2^, [2^ [l^. One such algorithm, 
Belief Propagation (BP), has been applied to the binary perceptron problem and has been shown to be able to find 
efficiently synaptic weight vectors that solve the classification problem for a number of patterns close to the maximal 
capacity (above 0.7 bits per synapse) [2^. However, this algorithm has a number of biologically unrealistic features 
(e.g. memory stored in several analog variables). Here, we explore algorithms that are inspired from the BP algorithm 
but are modified in order to make them biologically realistic. 

The paper is organized as follows: First we present the general scheme for the simplest setup of ±1 patterns and 
synapses as well as results with bounded and unbounded hidden variables. Then we discuss the more realistic 0,1 
case with results including the sparse coding limit. Implications of our results are discussed in the concluding section. 
Details are given in the Supporting Information. 



II. BINARY ± 1 NEURONS AND SYNAPSES 



A. The model neuron 



We consider a neuron with two states ('inactive' and 'active') together with its N presynaptic inputs which we 
take to be also binary. Depending on the time scale, these two states could correspond in a biological neuron either 
to emission of a single spike or not, or to elevated persistent activity or not. The strength of synaptic weights from 
presynaptic neuron i {i = 1, . . . , N) is denoted by Wi. Given an input pattern of activity {^i, i ~ 1, ■ • ■ , N}, the 
total synaptic input received by the neuron is / = X^iLi '"'iCi- The neuron is active if this total input is larger than 
a threshold 0, and is inactive otherwise. Such a model neuron is sometimes called a perceptron |15| . In this paper 
we consider binary synaptic weights. In addition, each synapse is characterized by a discrete 'hidden variable' that 
determines the value of the synaptic weight. In this section we consider {— 1,-|-1} neurons and synapses, and 9 — Q\ 
in order to simplify the notation, we will also assume N to be odd, so that the total synaptic input is never equal to 
0. This assumption can be dropped when dealing with {0, 1} model neurons. 



B. The classification problem 

We assume that our model neuron has to classify a set of p = aN random input patterns {^f , i — 1, . . . ,N,a — 
1, . . . ,p} into two classes (active or inactive neuron, (T° = ±1). The set of patterns which should be classified as 
-|-1 (—1) is denoted by S+ (>=■-) respectively. In each pattern, the activity of input neurons is set to 1 or —1 with 
probability 0.5, independently from neuron to neuron and pattern to pattern. The learning process consists in finding 
a vector of synaptic weights w such that all patterns in S-(_(resp. S_) are mapped onto output + (resp. — ). Hence, 
the vector w has to satisfy the p equations 

a" = sign for a = 1, . . . , p. (1) 

We will call such a vector a perfect classifier for this problem. 

In the case of ±1 synapses and inputs we are considering in this section, the problem is the same if we consider the 
set E!_ to be empty, i.e. cr° = -1-1 for all a = 1, . . . ,p, as we can always redefine — + '^j^j ^^'^ require the output to 
be positive. This will no longer hold in next section. 



C. The perceptron learning algorithm 

In the case of unbounded synaptic weights, there exists a standard learning algorithm that can find a perfect 
classifier, provided such a classifier exists, namely the perceptron algorithm (SP) p^. [l6j. The algorithm consists in 
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presenting sequentially the input patterns. When at time r pattern ^"^ is presented, one first computes the total input 
I = E!Ii wUl and then: 

• If / > 0: do nothing. 

• If / < 0: change the synaptic weights as follows: 

This algorithm has the nice feature that it is guaranteed to converge in a finite time if a solution to the classification 
problem exists. Furthermore, it has other appealing features that makes it a plausible candidate for a neurobiological 
system: the only information needed to update the synapse is an 'error signal' (the synapse is modified only when the 
neuron gave an incorrect output), and the current activity of both presynaptic and postsynaptic neurons. However, 
the convergence proof exists for unbounded synaptic weights IS*], or for sign-constrained synaptic weights [2^ . but 
not when synaptic weight can take only a finite number of states. 

D. Requirements for biologically plausible learning algorithms with binary weights 

In this paper, we explore learning algorithms for binary synaptic weights. Each synapse is endowed with an 
additional discrete 'hidden' variable hi. This hidden variable could correspond to the state of the protein interaction 
network of the post-synaptic density, which can in principle be multistable due to positive feedback loops in such 
networks [3, 0] . Each synaptic weight Wj will depend solely on the sign of the corresponding hidden variable hj ; in 
the following, in order to avoid the ambiguous hi = state, we will always represent the hidden variables by odd 
integer numbers (this simplifies the notation but doesn't affect the performance). We first consider the (unrealistic) 
situation of an unbounded hidden variable, and then investigate learning with bounded hidden variables. Similar to 
the perceptron algorithm, we seek 'on-line' algorithms (i.e. modifications are made only on the basis of the currently 
presented pattern) which, at each time step r, modify synapses based only on variables available to a synapse: (i) The 
current total synaptic input H and hence the current post-synaptic activity; (ii) The current presynaptic activity ; 
(iii) An error signal indicating whether the output was correct or not. At each time step, the current input pattern is 
drawn randomly from the set of patterns, and the hidden variables /ij — > h'^'^^ and the synaptic weights Wj — s- w'^'^^ 
are updated according to the algorithm. 

E. Quantifying performance of various algorithms 

The maximal number of patterns for which a weight vector can be found is a^ax — 0.83 for random unbiased patterns 
p^ . Hence, the performance of an algorithm can be quantified by how close the maximal value of a at which it can 
find a solution is to amax- In practice, one has to introduce a maximal number of iterations per pattern. For example, 
a complete enumeration algorithm (in which one checks sequentially the 2^ possible synaptic weight configurations) 
is guaranteed to find a solution for any a < amax, but it finds it in an implausibly long time (exponentially large in 
N). Here, we impose a maximal number of iterations (typically 10^ per pattern) and find the maximal value of a for 
which a given algorithm is able to find a solution. 

F. Belief propagation— inspired algorithms 

A modification of the Belief Propagation (BP) algorithm was found by Braunstein and Zccchina[28l| to perform 
remarkably well in the random binary perceptron problem. However, the BP algorithm has some features which 
make it unplausible from the biological point of view. In the Supporting Information, we show that with a number of 
simplifications, this algorithm can be transformed into a much simpler on-line one that satisfies all the requirements 
outlined above. The resulting algorithm is as follows: 

BP-mspired (BPI) Algorithm 

Compute / = • w'^, where wJ — sign (hJ), then 
(Rl) If / > 1, do nothing 
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FIG. 1: Schematic representation of transitions between synaptic states in the CP algorithm and the BPI algorithm. The 
cascade model introduced by Fusi et al is shown for comparison. Circles represent the possible states of the internal 
synaptic variable hi. Grey circles correspond to uii = — 1, white ones to Wi = 1. Clockwise transitions happen when = 1, 
counter-clockwise when — —1. Horizontal transitions are plastic (change value of synaptic efficacy Wi), vertical ones meta- 
plastic (change internal state only). Downwards transitions make the synapse less plastic, upward ones more plastic. When 
the output of the neuron is erroneous, ^ • u; < 0: transitions occur to the nearest neighbor internal state. In the CP algorithm, 
when the output is correct, ^ ■ w > 0: no transitions occur. In the BPI algorithm, when the output is barely correct ^ ■ w — 1 
(a single synaptic flip could have caused an error): transitions are made towards less plastic states only. When the output is 
safely correct, ^ ■ w > 1: no transitions occur. In the cascade model, 'down' transitions are towards nearest neighbors, while 
'up' transitions are towards the highest state with opposite sign. Transition probabilities decrease with increasing \h\, see[l^ 
for more details 

(R2) If / = 1 then: 

(a) lf/j[er>l,then }^+' = h^ + 2^l 

(b) Else do nothing 

(R3) If / < -1 then hj+^ = hj + 2^^. 

These rules can be interpreted as follows. (Rl) As J > 1 the synaptic input is sufficiently above threshold, such 
that a single synaptic (or single neuron) flip would not affect the neuronal output; therefore all variables are kept 
unchanged. (R2) As / = 1 the synaptic input is just above threshold (a single synaptic or single neuron flip could 
have potentially brought it below threshold), then some of the hidden variables need to be changed. The variables 
that are changed are those that were going in the right direction, i.e. those that contributed to having the synaptic 
input go above threshold. Finally for (R3) / < so the output is incorrect and then all hidden variables need to be 
changed. The factor of 2 included in rules R2 and R3 guarantees that the hidden variables will still be odd when 
updated if they are initialized to be so. 

Note that this algorithm has two distinct features compared to the perceptron algorithm: (i) Hidden variables obey 
update rules that are similar to those of the SP algorithm, but the actual synaptic weight is binary; (ii) One of the 
update rules, rule R2 (corresponding to a synaptic input just above threshold), is new compared to SP. 

To investigate the effect of rule R2 on performance, we also simulated a stochastic version of the BPI algorithm, 
in which such a rule is only applied with probability ps for each presented pattern: 

Stochastic BP-inspired (SBPI) Algorithm 

As BPI, except rule R2 is replaced by: 
(R2) If / = 1, then: 

(a) with probability's: 

i. If /iier > 1. then = hl + 2£,l 

ii. Else do nothing 

(b) with probability 1 — Ps, do nothing 

When the parameter ps is set to 1, one recovers the deterministic BPI algorithm, while setting it to (thus, in 
fact, removing rule R2) transforms it into a 'clipped perceptron algorithm' (CP), i.e. a perceptron algorithm but 
with clipped synaptic weights.) Both BPI and CP algorithms are sketched in Fig. [T] 
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FIG. 2: Performance of the BPI algorithm with unbounded hidden variables: A-C convergence time vs. A'' for different values 
and a (indicated on each graph). Points correspond to number of iterations per pattern until the algorithm converges averaged 
over 200 pattern sets, vertical bars are standard deviations. Dotted lines: CP, Solid lines: BPI, Dashed lines: SBPI with 
Pa — 0.3. The latter is the only one which can reach a = 0.6, but performs worse than BPI for a < 0.3 (it is absent from panel 
A for clarity). D. Probability that the BPI algorithm learns perfectly 0.3 ■ patterns in less than T — x ■ log(A'^)^'^ iterations 
per pattern vs x for various values of A'' 



G. Performance with unbounded variables 



The performance of both deterministic and stochastic versions of the BPI algorithm was first investigated nu- 
merically with unbounded hidden variables, for different values of a, N and Ps- It turns out that SBPI performs 
remarkably well, provided the probability pg is chosen appropriately - with p^ « 0.3 the system can reach a capacity 
of order 0.65 with a convergence time that increases with in a sub- linear fashion (see Fig.[2|). On the other hand, 
the deterministic BPI {ps ~ 1) has a significantly lower capacity {a ~ 0.3), but for those lower values of a it performs 
significantly faster than the SBPI algorithm - for a = 0.3 the time increases approximately as (log N)^-^, as shown in 
Fig. [2p. As an example, the algorithm perfectly classifies 38400 patterns with 128001 synapses with around 35 pre- 
sentations of each pattern only. By eliminating completely rule R2 (i.e. CP) convergence time becomes exponential 
in N rather than logarithmic, for every tested value of a, as shown by the supralinearity of the blue curves in Fig. [21 
Hence, the specificity of rule R2 with respect to synapses (only synapses that actually went in the right direction for 
the current pattern should be modified) is a crucial feature which makes the BPI algorithm qualitatively superior. 
Moreover the convergence time increases only mildly with a, as shown in Fig. [2l 

We also find that there is a tradeoff between convergence speed and capacity: for each value of a, there is an optimal 
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value of that minimizes average convergence time (shown in the Supporting Information, Fig. 6). This optimal 
value decreases with a; for a — 0.3 it is close to 1, and decreases to 0.3 at a = 0.65. Hence, decreasing enhances the 
capacity, at the cost of a slower convergence; nevertheless Fig.[2lC shows that for values of a < 0.60 SBPI {ps — 0.3) 
learns perfectly the set of input/output associations in a time that scales sub-linearly with N. Above a > 0.7 the 
algorithm fails to solve instances in a time shorter than the chosen cutoff time of 10'*. Note that for ps = 0.3 the 
convergence time depends in a more pronounced way on a than in the ps ~ 1 case. 

We have also investigated an algorithm in which ps is itself a dynamical variable that depends on the fraction of 
errors averaged over a long time window - such an algorithm with an adaptive ps is able to combine faster convergence 
at low values of a with high capacity associated with low values of ps (not shown) . 

H. Performance with bounded hidden variables 

We now turn to the situation when there is only a limited number of states K of the hidden variables hi, since 
it is unrealistic to assume that a single synapse can maintain an arbitrarily large number of hidden states. Thus, 
we investigated the performance of an algorithm with symmetrical hard bounds on the values of the hidden states, 
\h,\ < if - 1 for all i. 

Figure [3] shows what happens when the number of internal states is kept fixed while varying N. For the number of 
states we have considered, (10 < K < 40), the optimal value of Ps is 1, since in general the stochastic version of the 
algorithm requires a larger number of states to be efficient. Here, we defined the capacity as the number of patterns 
for which there is 90% probability of perfect learning in lO'' iterations, and plotted in Fig. [3]the corresponding critical 
a against N for different values of the states number K, comparing BPI, CP, and the cascade model (defined as in 
Fig. [1]). We also compared these algorithms that have only 2 'visible' synaptic states but K hidden states, with the 
SP algorithm with K 'visible' states, Wi = hi. 

It turns out that BPI achieves a higher capacity than the SP algorithm with K visible states, when K is fixed 
and N is sufficiently large, even though the maximal capacity of the binary device is lower. Interestingly, adding an 
equivalent of rule R2 to the SP algorithm allows it to overcome BPI. This issue is further discussed in the Supporting 
Information. 

It is also interesting to note that at very low values of N, performance is better using 20 states than with an infinite 
number of states. Intuitively, this may be due to the fact that in the unbounded case some synapses are pushed too 
far and get stuck at high values of hi, i.e. they lose all their plasticity, while a solution to the learning problem would 
require them to come back to the opposite value of Wi. 

The last panel in Fig. [3] compares how convergence time changes with a for the same four algorithms, with the 
same number of synapses and same number of states per synapse: while the cascade model has a clear exponential 
behavior, the BPI and SP algorithms maintain nearly constant performance almost up to their critical point. The 
CP algorithm is somehow in between, its performance degrading rapidly with increasing a (note the logarithmic 
scale). 

Following the observation that an appropriate number of internal states K can increase BPI capacity, we searched 
for the value of K that optimizes capacity, and found that it scales roughly as VN (see the corresponding section and 
Fig. 8 in the Supporting Information) ; this is consistent with the observation that the distribution of hidden states 
scales as Vn (also discussed in the Supporting Information, see Fig. 7). The fact that the capacity is optimal for a 
finite value of K makes the BPI algorithm qualitatively different from the other three, whose performance increases 
monotonically with K. 

For a system with a number of states that optimizes capacity, the optimal value for ps is 0.4, rather than 0.3 as 
in the unbounded case. With these settings it is possible to reach a capacity ac of almost 0.7 bits per synapse, very 
close to the theoretical limit amax — 0.83. Convergence time at high values of a scales roughly linearly with N, but 
with a very small prefactor (si 2 • 10~^). 

III. BINARY 0,1 NEURONS AND SYNAPSES, SPARSE CODING 
A. The 0,1 model neuron 

±1 neurons with dense coding (equal probability of + or —1 inputs) have the biologically unplausible feature of 
symmetry between the two states of activity. 

A first step towards a more biologically plausible system is to consider the situation in which both the synaptic 
weights Wj and neurons are 0,1 binary variables, and the inputs are = 1 with probability /, and with probability 
(1 — /), where / G [0,0.5] is the 'coding level'. In this case, we need to take a non-zero threshold 6 > 0. In the 
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FIG. 3: Performance of various algorithms with hard-bounded hidden variables. Triangles: BPI, squares: CP, circles: SP, 
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following, we choose the threshold to be « 0.3Nf (see Supporting Information for details). We consider each input 
pattern a to have a desired output cr" = 1 with probability / and cr" = with probability (1 — /). 
The new classification problem amounts at finding a vector w which satisfies the p equations: 



= e 



N 



for a = 1, 



(2) 



B. The optimized algorithm 

The BP scheme can be straightforwardly applied to the 0,1 perceptron (see Supporting Information for details); 
the resulting BPI algorithm is very similar to the one presented above, with two major differences: (i) The quantity 
to be evaluated at each pattern presentation is not the total input / = ^ rather the 'stability parameter' 

A — (2cr'^ — 1) (/ — 6), which is positive if the pattern is correctly classified and negative otherwise, (ii) Synaptic 
weights are now computed as wj — ^ (sign(ft,[) + 1), making the synapse active (inactive) if the hidden variable is 
positive (negative), respectively. The 0,1 algorithm is then the same as the one for the ±1 case, in which A replaces 
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I. The performance of this algorithm is quahtatively very similar to the one for the ±1 case, with a lower capacity - 
about 0.25, to be compared with a theoretical limit of 0.59 [sot - 

We have explored variants of the basic BPI algorithm. In particular, we have studied a stochastic version of the 
algorithm in which rule R2 is applied with probability ps, but only for those patterns which require a'^ = 0. This 
SBPIOl algorithm consists in: 

SB PI 01 Algorithm 

Compute I ■ W", where wj = \ (sign + 1), and A = (2cr^ -!)(/- 0), then 
(Rl) If A > = 1, then do nothing 
(R2) If < A < 6i„ = 1, then 

(a) If cr^ = 0, with probability p,,. if wj = 0, then = - 

(b) Else do nothing. 

(R3) If A < 0, then /i[ + 2^J (2(t^ - 1). 

where we have introduced 6^, the threshold for applying rule R2. Since rule R2 is only applied to patterns with 
zero output cr'^ , the metaplastic changes affect only silent synapses (for which wj = 0) involved in the pattern (those 
for which = 1). Note that using rule R2 only for patterns for which cr" = not only optimizes performance, but 
also makes the algorithm simpler, since in this way there is only the need for one secondary threshold {6 — instead 
of two (which would have been required if rule R2 had to be applied in all cases). The opposite choice, i.e. using rule 
R2 only for patterns for which ctq = 1, can also be taken with similar results. 

As in the preceding case, introducing boundaries for the hidden variables hj can further improve performance, and 
the number of states K which maximizes capacity scales again roughly as y/N (shown in the Supporting Information, 
Fig. 8). In the case of dense coding, / = 0.5, and using the optimal value Ps = 0.4, SBPIOl can reach a storage 
capacity beyond 0.5 bits per synapse for sufficiently high A'', very close to the maximum theoretical value amax — 
0.59. 



C. Heterogeneous synapses and sparse coding 

One possible way to increase capacity with a very limited number of available states is to use 'sparse' coding, i.e. 
a low value for /. In an unsupervised learning scenario, it has been shown that purely binary synapses (e.g. only two 
hidden states) can perform well if / is chosen to scale as log N/N 0, Here, we chose an intermediate scaling 
/ = 1/ ^/N. In addition, we also introduced heterogeneity in synaptic efficacies. Possible synaptic weights were no 
longer and 1, but and where a; was drawn from a Gaussian distribution with mean 1, and standard deviation 0.1. 
Likewise, the threshold 9m used for the implementation of rule R2 was drawn randomly at each pattern presentation 
from a Gaussian distribution centered in 1 with variance 0.1 The resulting algorithm SBPI-Het was shown to have 
very similar performance to SBPIOl in the / = 0.5 case. 

In Fig. we show the maximum capacity ac (defined as for Fig. [3]) reached in the sparse coding case divided 
by the maximum theoretical value amax (which depends on /), with ps = I, N ranging from 1000 to 64000 and low 
number of internal states. The figure shows that a synapse with only two states (i.e. with no metaplasticity) has 
a capacity of only about 10% of the maximal capacity in the whole range of N investigated. Adding hidden states 
up to AT = 10 improves significantly the performance, which reaches about 70% of the maximal capacity for sizes 
of N of order 10000. In fact, for such values of N the capacity decreases when one further increases the number of 
states. The optimal number of states increases with N as in the dense coding case, but with a milder dependence on 
N. In fact, simple arguments based on unsupervised application of rule R2 predicts in this case an optimal number 
of states scaling as N-^^^/y/logN, which seems to be roughly consistent with our numerical findings. Fig. [3J3 shows 
convergence time versus a for A = 64000. It demonstrates again the speed of convergence of the SBPI algorithm, 
while the cascade model is significantly slower. 

IV. ROBUSTNESS AGAINST NOISE 

Binary devices have the advantage of simplicity and robustness against noise. Here we briefly address the issue 
of resistance against noise which might affect the multi-stable hidden states. Intuitively, the fact that the synaptic 
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weights in the BPI algorithm only depend on the sign of the corresponding hidden variables, suggests that a device 
implementing such learning scheme would be more resistant against accidental changes in the internal states with 
respect to a device in which the multi-stable state is directly involved in the input summation. We verified this 
by comparing a perceptron with binary synapses and K hidden states implementing the SBPI algorithm with a 
perceptron with synapses with K visible states implementing the SP algorithm, both in the unbounded and in the 
bounded cases. The protocols used for testing robustness and the corresponding results are presented in the Supporting 
Information. 

In all the situations we tested, we found a pronounced difference between the two devices, confirming the advantage 
of using binary synapses in noisy environments or in presence of unreliable elements. 



V. DISCUSSION 



In this paper, we have shown that simple on-line supervised algorithms lead to very fast learning of random input- 
output associations, up to close to the theoretical capacity, in a system with binary synapses and a finite number of 
hidden states. The performance of the algorithm depends crucially on a rule which leads to synaptic modifications 
only if the currently shown pattern is 'barely learned' - that is, a single synaptic flip would lead to an error on 
that pattern. In this situation, the rule requires the synapse to have metaplastic changes only. Only synapses that 
contributed to the correct output need to change their hidden variable, in the direction of stabilizing the synapse 
in its current state. This rule originates directly from the Belief Propagation algorithm. We have shown that this 
addition allows the BPI algorithm to learn a fraction of bits of information per synapse with at least roughly an order 
of magnitude less presentations per pattern than any other known learning protocol already at moderate system sizes 
and moderate values of a. Furthermore, for a neuron with about 10"* - 10^ synapses, when a G [0.3 — 0.6], the BPI 
algorithm flnds a solution with a few tens of presentations per pattern, while the CP algorithm is unable to find such 
a solution in 10"* presentations per pattern. Finally, we showed that this algorithm renders a model with only two 
visible synaptic states and K hidden states much more robust to noise than a model with K visible states. 

Other recent studies have considered the problem of learning in networks with binary synapses. Senn and Fusi 
[20| introduced a supervised algorithm that is guaranteed to converge for an arbitrary set of linearly separable 
patterns, provided there is a finite separation margin between the two classes. For sets of random patterns, this 



last requirement limits learning to a number of patterns which does not increase with N. Fusi et al |12l | introduced 
a model that bears similarity with the model we consider (binary synapses with a finite number of hidden states), 
with unsupervised transitions between hidden states. We have shown here that a supervised version of this algorithm 
performs significantly worse than the BPI algorithm. 

Since the additional simple rule R2 has such a spectacular effect on performance, we speculate that neurobiological 
systems that learn in presence of supervision must have found a way to implement such a rule. The prediction is 
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that when a system learns in presence of an 'error' signal, and synaptic changes occur in presence of that signal, then 
metaplastic changes should then occur in absence of the error signal, but when the inputs to the system are very close 
to threshold. After exposure to such an input, it should be more difficult to elicit a visible synaptic change, since the 
synaptic hidden variables take larger values. 

The fact that the algorithms developed here are digital during retrieval and that discrete (even noisy) hidden 
variables are only needed during learning could also have implications in large-scale electronic implementations, in 
which the overhead associated with managing and maintaining multi-stable elements in a reliable way may be of 
concern. 
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